Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Grammars Compiler #3915

Merged
merged 15 commits into from
Nov 30, 2017
Merged

New Grammars Compiler #3915

merged 15 commits into from
Nov 30, 2017

Conversation

vmg
Copy link
Contributor

@vmg vmg commented Nov 22, 2017

Alright! Here's the new grammars compiler I've been working on. The old Ruby script/convert-grammars has been replaced by an actual compiler written in Go. The source code can be found in tools/grammars/.

To note:

  • The compiler is being distributed as a Docker image, linguist/grammars-compiler:latest on Docker.io. The image is self-contained (duh) and includes all the dependencies, most notably NPM and the CSON parser it needs to run. package.json has been removed from this repository because eek.

  • The new script/grammars-compiler is a thin wrapper over running the Docker image with the arguments you pass in.

  • The main goal of the compiler is ensuring that all the grammars we add to Linguist are sensible and not malformed. This was not the case previously. I've had to manually send PRs to many upstream repositories with different fixes for issues that the compiler has detected. Their corresponding submodules have been upgraded in d08375d

  • The compiler also normalizes the grammars into a fixed schema, removing superfluous keys and unifying content structures. This schema is serializable to JSON (like the old compiler) or to a Protobuf library, which we're going to start using internally.

  • As it stands, this build is now green, although the compilation step is not fully "clean" yet. I'm going to open another issue with a list of outstanding issues in the existing grammars.

cc @github/linguist

  • The new compiler can output

@vmg
Copy link
Contributor Author

vmg commented Nov 22, 2017

Here are all the known issues in existing grammars that the compiler is detecting:

$ ./script/grammar-compiler 
 299 / 299 [===================================================================================================================] 100.00% 7s
done! 437 scopes converted from 299 grammars (34 errors)

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/sublime-spintools/Spin.tmLanguage
	keys: [bundleUUID]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/pike-textmate/Pike.tmbundle/Syntaxes/Pike.plist
	keys: [Patterns[5].swallow Patterns[6].swallow foregroundColor backgroundColor increaseIndentPattern]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/idl.tmbundle/Syntaxes/IDL.tmLanguage
	keys: [isDisabled]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/c.tmbundle/Syntaxes/Platform.tmLanguage
	keys: [hideFromUser]

- unknown keys in grammar
	file: Clams-sublimesystemverilog-8c37a53ac265/SystemVerilog.tmLanguage
	keys: [hidden]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/sublime-glsl/GLSL.tmLanguage
	keys: [Patterns[3].beginCapture]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/sublime-MuPAD/MuPAD.tmLanguage
	keys: [highlightPairs smartTypingPairs]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/Lean.tmbundle/Syntaxes/Lean.tmLanguage
	keys: [Patterns[3].pattern]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/TypeScript-TmLanguage/TypeScript.YAML-tmLanguage
	keys: [variables]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/language-jison/grammars/jisonlex-injection.cson
	keys: [injectionSelector]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/language-typelanguage/grammars/typelanguage.tmLanguage.json
	keys: [$schema]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/MagicPython/grammars/MagicPython.cson
	keys: [first_line_match]

- unknown keys in grammar
	file: /src/linguist/vendor/grammars/objective-c.tmbundle/Syntaxes/Platform.tmLanguage
	keys: [hideFromUser]

- scope 'comment.block' cannot be found in library
	included from grammar: text.html.mako

- scope 'source.smarty' cannot be found in library
	included from grammar: text.html.basic

- scope 'source.js.jquery' cannot be found in library
	included from grammar: source.scaml

- scope 'source.ruby.rails' cannot be found in library
	included from grammar: text.slim

- scope 'meta.scope.any.mathematica' cannot be found in library
	included from grammar: source.mathematica

- scope 'source.smarty' cannot be found in library
	included from grammar: file.lasso

- scope 'source.ruby.rails' cannot be found in library
	included from grammar: source.yaml

- scope 'text.html.textile' cannot be found in library
	included from grammar: source.shell

- scope 'source.ruby.rails' cannot be found in library
	included from grammar: text.html.liquid

- scope 'source.js.jquery' cannot be found in library
	included from grammar: source.crystal

- scope 'source.ruby.rails' cannot be found in library
	included from grammar: text.haml

- scope 'text.slm' cannot be found in library
	included from grammar: text.html.vue

- scope 'text.pug' cannot be found in library
	included from grammar: text.html.vue

- scope 'source.postcss' cannot be found in library
	included from grammar: text.html.vue

- scope 'source.ruby.rails' cannot be found in library
	included from grammar: source.yaml-ext

- scope 'source.js.jquery' cannot be found in library
	included from grammar: source.ruby

- scope 'source.AtLilyPond' cannot be found in library
	included from grammar: text.roff

- scope 'source.asm' cannot be found in library
	included from grammar: source.nim

- scope 'text.html.markdown.multimarkdown' cannot be found in library
	included from grammar: source.nim

- scope 'js-expression' cannot be found in library
	included from grammar: source.mask

- scope 'meta.define.event' cannot be found in library
	included from grammar: source.netlinx.erb

As you can see, all the grammar now parse successfully. The main remaining issues are regarding keys which we don't know how to handle, and missing dependencies.

For the unknown keys, there's already a whitelist in the compiler with keys which we don't handle but are "acceptable" (namely, comment, uuid and things like that which we don't need to load the grammar). The remaining reported keys are just weird stuff -- and some of them typos.

The dependencies are a trickier thing. Lots of grammars in the existing library attempt to include grammars that don't exist. The biggest offender was javadoc, which I've manually added in this PR because lots of grammars use it. The other ones would need to be added too, or blacklisted (when blacklisting includes, the compiler will fix the grammar so it doesn't attempt to bring them in).

@@ -1,10 +1,9 @@
---
type: grammar
name: ruby.tmbundle
name: javadoc.tmbundle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh!? Why's this happening? (I've not looked too deeply or played with this, just noticed it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is two issues in one:

  1. There was a ruby.tmbundle license in the Licenses folder which was just outdated (since we're now using the Atom Ruby package, which is not called ruby.tmbundle). When I ran the license checker, it removed it.

  2. I added the new Javadoc grammar, which happens to have a license identical to ruby.tmbundle (and to most other Textmate grammars, they all share the same license).

  3. Git saw a removed license, and a new added one, and marked it as a rename (h e u r i s t i c s) instead of an outdated license removed and a new one added.

@@ -84,13 +84,13 @@ if repo_old
log "Deregistering: #{repo_old}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move scripts to the new tools/ directory or is there a clear distinction between the two (besides the fact that one contains scripting language programs)? If the idea is to always use tools through scripts maybe tools/ should be a subdirectory (with a new name?) in script/...?

@pchaigno
Copy link
Contributor

Thanks for this @vmg!

For the remaining issues in grammars, do you want to fix them all before merging this? Do you need help to report them upstream and try to fix them?

I'm not very familiar with Go, but is it necessary to commit the dependencies (tools/grammars/vendor/)? In particular, since there's a Gopkg.toml file, I though dep could use it to retrieve dependencies.

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 27, 2017

Excellent work! 😀 👍 Does the compiler handle any PCRE-incompatible regex extensions, such as Oniguruma's \h escape? If not, it'd be awesome if it did. \h is one particularly problematic case which surfaces often in many grammars. In PCRE/Perl, \h matches a horizontal whitespace character (space or tab, but not a newline). In Oniguruma, \h matches a hexadecimal digit ([A-Fa-f0-9]).

There are other incompatibilities between regular expression engines, but I'd say the matter of \h is the most glaring one, and certainly the one most likely to appear with future updates upstream.

@vmg
Copy link
Contributor Author

vmg commented Nov 28, 2017

Thanks @pchaigno @Alhadis!

For the remaining issues in grammars, do you want to fix them all before merging this?

They are non-blocking issues, since the compiler can work around them (e.g. by removing the unknown dependencies), so I'll merge this before they are fully fixed.

Do you need help to report them upstream and try to fix them?

I would love that. I don't have the time to fix them myself like I did for the parsing errors because I'm working on the backend this week.

I'm not very familiar with Go, but is it necessary to commit the dependencies (tools/grammars/vendor/)?

It is not necessary, but it is customary. Since we're distributing a pre-built Docker image, I think I can just not commit all these code. I'll remove it!

Does the compiler handle any PCRE-incompatible regex extensions, such as Oniguruma's \h escape?

It does not, and this is a fantastic suggestion I hadn't thought of. I'll try to work on this today.

@lildude
Copy link
Member

lildude commented Nov 28, 2017

I've not played with this yet, but one thing that did cross my mind is this adds a new little barrier to entry when adding new grammars as it now requires Docker be installed.

We definitely need to add this requirement to the CONTRIBUTING.md file

@vmg
Copy link
Contributor Author

vmg commented Nov 28, 2017

this adds a new little barrier to entry when adding new grammars as it now requires Docker be installed

It does remove the requirement of having Node.js and NPM installed, so overall I'm not concerned about bringing in a Docker dependency.

@Alhadis
Copy link
Collaborator

Alhadis commented Nov 28, 2017

For the remaining issues in grammars, do you want to fix them all before merging this? Do you need help to report them upstream and try to fix them?

I better explain why this one is happening, since it's coming from one of my grammars:

- scope 'source.AtLilyPond' cannot be found in library
	included from grammar: text.roff

source.AtLilyPond is the scopeName used by Atom's LilyPond package,
while source.lilypond is the scopeName used by TextMate's bundle (which GitHub/Linguist is using).
To enable embedded LilyPond highlighting in both Atom and on GitHub, I simply included them both.
Whichever grammar is available will be used to match the sembedded LilyPond snippet. So I'd say this error should be ignored.

On a related note, see the recent discussion over at Atom concerning their new grammar-parsing system that's nearing completion soon... you might need to update the list of unrecognised properties at some point.

@lildude
Copy link
Member

lildude commented Nov 29, 2017

Does the compiler handle any PCRE-incompatible regex extensions, such as Oniguruma's \h escape?

It does not, and this is a fantastic suggestion I hadn't thought of. I'll try to work on this today.

Does that mean we can now update to the most recent version of https://github.com/gandm/language-babel and finally deal with #3044?

Will this also help with russian characters as detailed in #3291?

@vmg
Copy link
Contributor Author

vmg commented Nov 29, 2017

Does that mean we can now update to the most recent version of https://github.com/gandm/language-babel and finally deal with #3044?

Looking at the output of the compiler, it seems like the JSX grammar we're currently running (i.e. the pinned fork) is also broken. I tried upgrading to upstream and the grammar has the same (trivially fixable) issues:

Lots of grammar rules use Oniguruma-specific syntax \k<-1>, which according to the Oniguruma docs, means relative group backreferences. As an example \k<-3> would be "a backreference to the 3rd group before this position" (counting backwards that is). The way to make this work in any regexp engine is to actually give a name to the group you're pointing at and using \k<name>. That also has the added benefit of making the regexp look slightly less insane.

It seems like a trivial enough fix if somebody would like to attempt to upstream it. It would definitely be backwards compatible, and I think it'd allow us to switch back to master.

Will this also help with russian characters as detailed in #3291?

This is an unrelated issue, I'm afraid. PCRE supports the syntax just fine, it just doesn't interpret characters in the extended Unicode range like Russian.

@vmg
Copy link
Contributor Author

vmg commented Nov 29, 2017

Will this also help with russian characters as detailed in #3291?

In brighter news, I've just tried enabling support for Unicode properties and I don't see a significant performance regression -- less than 5%. We may be able to leave them turned on as we roll out the new service.

@vmg vmg mentioned this pull request Nov 30, 2017
17 tasks
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WFM locally without any errors. ✨

@vmg vmg merged commit e335d48 into master Nov 30, 2017
@vmg
Copy link
Contributor Author

vmg commented Nov 30, 2017

Following up on #3924 with all the issues that the compiler detected.

@lildude
Copy link
Member

lildude commented Dec 2, 2017

Just played with this again today and thought of #3924 whilst doing so.

I think it would be useful for the long-term health of Linguist and the grammars if we output the grammar issues for the grammar that is being added so peeps know right away that we've detected problems.

What do you think @vmg?

@lildude lildude deleted the vmg/fixed-grammars branch December 2, 2017 12:12
@vmg
Copy link
Contributor Author

vmg commented Dec 4, 2017

@lildude: As far as I can tell, this is the case. When calling script/add-grammar, the compiler will be ran and if the grammar has any outstanding issues they will be reported. Is this functionality broken?

@lildude
Copy link
Member

lildude commented Dec 4, 2017

@vmg I've not tested with adding a new grammar, but I did test with replacing a grammar (our longest serving desirable):

$ script/add-grammar --replace language-babel https://github.com/gandm/language-babel
cp lib/linguist/samples.json tmp/x86_64-darwin16/stage/lib/linguist/samples.json
install -c tmp/x86_64-darwin16/linguist/2.4.0/linguist.bundle lib/linguist/linguist.bundle
cp tmp/x86_64-darwin16/linguist/2.4.0/linguist.bundle tmp/x86_64-darwin16/stage/lib/linguist/linguist.bundle
$

After doing the above, I ran script/bootstrap (to ensure new submodule is pulled down and used) and then when I run script/grammar-compiler, I see this grammar has problems:

[...]
- [ ] repository `vendor/grammars/language-babel` (from https://github.com/gandm/language-babel) (14 errors)
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<=^|{|,)\s*+(\b[_$a-zA-Z][$\w]`...": subpattern name expected (at offset 281))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<=^|{|,)\s*+(\b[_$a-zA-Z][$\w]`...": subpattern name expected (at offset 271))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<=^|{|,)\s*+(('|\")([^"']*)(\k`...": subpattern name expected (at offset 33))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<=^|{|,)\s*+(('|\")([^"']*)(\k`...": subpattern name expected (at offset 33))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<!:)\s*+(\bstatic\b)?\s*+(\bas`...": subpattern name expected (at offset 78))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?<=^|{|,)\s*+(('|\")([^"']*)(\k`...": subpattern name expected (at offset 33))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`\s*+(\basync\b)?\s*+(?=(<(?:(?>[`...": subpattern name expected (at offset 240))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`\s*+(\b[_$a-zA-Z][$\w]*)\s*+(=)\`...": subpattern name expected (at offset 271))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`\s*+(\b[A-Z][$\w]*)?(\.)(prototy`...": subpattern name expected (at offset 304))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`\s*+(\b_?[A-Z][$\w]*)?(\.)([_$a-`...": subpattern name expected (at offset 291))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(^\s*+(?=([$\w]*\s*+\??\s*+(:|=(`...": nothing to repeat (at offset 73))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`\s*+((("|').*?(?<=[^\\])\k<-1>)|`...": subpattern name expected (at offset 27))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(^|:|;|=|(?<=:|;|=))\s*+(\((?=((`...": subpattern name expected (at offset 52))
    - [ ] Invalid regex in grammar: `source.js.jsx` (in `grammars/Babel Language.json`) contains a malformed regex (regex "`(?:^|;)\s*+(\bstatic\b)?\s*+(\ba`...": subpattern name expected (at offset 422))
[...]

On a side note, it would be nice to be able to query just a single grammar too.

I also noticed that the script doesn't exit if Docker isn't available/running:

$ script/add-grammar --replace language-babel https://github.com/gandm/language-babel
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
cp lib/linguist/heuristics.rb tmp/x86_64-darwin16/stage/lib/linguist/heuristics.rb
cp lib/linguist/vendor.yml tmp/x86_64-darwin16/stage/lib/linguist/vendor.yml
install -c tmp/x86_64-darwin16/linguist/2.4.0/linguist.bundle lib/linguist/linguist.bundle
cp tmp/x86_64-darwin16/linguist/2.4.0/linguist.bundle tmp/x86_64-darwin16/stage/lib/linguist/linguist.bundle
$

Want new issues for all of these points?

@vmg
Copy link
Contributor Author

vmg commented Dec 4, 2017

I've not tested with adding a new grammar, but I did test with replacing a grammar (our longest serving desirable)

I see. The compiler is outputting error messages, but I'm afraid that the add-grammar Ruby script is not very robust and discards stderr from the commands it runs. I'll send you a PR.

@kivikakk kivikakk mentioned this pull request Dec 7, 2017
@lildude lildude mentioned this pull request Jan 26, 2018
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants